Drastic performance improvements for reads (#249) #342

johannesloibl · 2024-12-09T21:16:43Z

Possible solution for #249 .

See my comment here.
Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).

I managed to greatly improve the performance by at least 12x (on 100k signals) to 25x (on 10k signals):

caching
skipping unneeded iterations over all data objects
skipping unneeded file seeks

What is still a problem: the reading time still scales O(n²) with the amount on channels and chunks, because the channel has to be found by iterating through the list of data objects AND chunks.
If i read only 10k signals, the improvement factor is currently ~25x, for 100k signals it's only ~12x.

You see most of the time is now spent in the _read_channel_data_chunk and _get_channel_number_values functions. Once the channel is found, the search is ended (before it was iterating till the end). This mean for many signals the time to find the desired signal increases.
We need to find a way to calculate where the channel position is in order to greatly speed this up,
but i don't know the internals with chunking not well enough to do this.

Before:

With my changes:

if order_objects did not change

use chunk size and index to skip to expected chunk

nptdms/tdms_segment.py

adamreeve

Thanks for the contribution @johannesloibl, this looks great. I've left some review comments to be addressed

nptdms/tdms_segment.py

adamreeve · 2024-12-10T09:28:45Z

Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).

Just to add some more context and make sure I'm understanding this right, can you provide an example of how you're reading a file? Are you reading a subset of the channels and for each channel, reading all data at once?

johannesloibl · 2024-12-10T10:07:01Z

Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).

Just to add some more context and make sure I'm understanding this right, can you provide an example of how you're reading a file? Are you reading a subset of the channels and for each channel, reading all data at once?

We are reading subsets of each channel. The data is lab measurement data.
For each of N (~1-20) different operating conditions, each TDMS channel is appended either a scalar measurement or an array. So the final channel length of an array measurement with 1000 pts and 100 iterations would be 100k. We have M (~1-100) channels.
We now want to read out the data for each operating condition separately, which means for each operating condition we read M channels partially. And there basically the amount of read operations explode.
I don't want to read all channels fully at once, because i want the reader to also work in memory restricted and scalable cloud environments.

If you open this file with the TDMS Excel plugin, the cover page has 50k rows :D

johannesloibl · 2024-12-11T07:41:11Z

Do you see a way to skip iterating through the data_objects and replace it by calculating the desired file offset?
This would be another major speed improvement for partial reads.

nptdms/base_segment.py

adamreeve · 2024-12-11T08:23:38Z

Do you see a way to skip iterating through the data_objects and replace it by calculating the desired file offset?
This would be another major speed improvement for partial reads.

It should be possible to compute them when reading the segment metadata, although that might add a bit more memory overhead, and this wouldn't be needed for the case where you read all data up front so we'd probably want to disable that behaviour in that case.

johannesloibl · 2024-12-11T20:01:59Z

Ready to merge from my point of view ;) Thanks for the quick support!
Glad my runtime reduced from 1.5h to 4min :D

adamreeve · 2024-12-11T20:04:45Z

Thanks for the contribution!

johannesloibl added 6 commits December 9, 2024 16:27

Cache _have_daqmx_objects

e35bc9c

if order_objects did not change

Cache chunk_size

0aaaade

Cache data_objects

8572b88

Reduce amount of seek calls

5d2903d

Faster eval of expression

ee3d836

don't iterate over all objects

57c2e7d

use chunk size and index to skip to expected chunk

johannesloibl commented Dec 9, 2024

View reviewed changes

nptdms/tdms_segment.py Show resolved Hide resolved

Fix code style

f05f230

johannesloibl commented Dec 9, 2024

View reviewed changes

nptdms/tdms_segment.py Show resolved Hide resolved

johannesloibl marked this pull request as ready for review December 9, 2024 22:25

adamreeve requested changes Dec 10, 2024

View reviewed changes

nptdms/tdms_segment.py Show resolved Hide resolved

nptdms/tdms_segment.py Show resolved Hide resolved

nptdms/tdms_segment.py Outdated Show resolved Hide resolved

johannesloibl added 2 commits December 10, 2024 10:54

Remove redundant seek

9e34b0f

Remove useless cache invalidation

9e99767

johannesloibl added 2 commits December 11, 2024 08:36

Add exception for truncated segments

c29dd26

Update gitignore to ignore PyCharm projects

33056c5

adamreeve approved these changes Dec 11, 2024

View reviewed changes

adamreeve reviewed Dec 11, 2024

View reviewed changes

nptdms/base_segment.py Outdated Show resolved Hide resolved

Remove unneeded chunk_size args

0c21730

adamreeve merged commit 7d4e0b9 into adamreeve:master Dec 11, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drastic performance improvements for reads (#249) #342

Drastic performance improvements for reads (#249) #342

johannesloibl commented Dec 9, 2024 •

edited

Loading

adamreeve left a comment

adamreeve commented Dec 10, 2024 •

edited

Loading

johannesloibl commented Dec 10, 2024 •

edited

Loading

johannesloibl commented Dec 11, 2024

adamreeve commented Dec 11, 2024

johannesloibl commented Dec 11, 2024

adamreeve commented Dec 11, 2024

Drastic performance improvements for reads (#249) #342

Drastic performance improvements for reads (#249) #342

Conversation

johannesloibl commented Dec 9, 2024 • edited Loading

adamreeve left a comment

Choose a reason for hiding this comment

adamreeve commented Dec 10, 2024 • edited Loading

johannesloibl commented Dec 10, 2024 • edited Loading

johannesloibl commented Dec 11, 2024

adamreeve commented Dec 11, 2024

johannesloibl commented Dec 11, 2024

adamreeve commented Dec 11, 2024

johannesloibl commented Dec 9, 2024 •

edited

Loading

adamreeve commented Dec 10, 2024 •

edited

Loading

johannesloibl commented Dec 10, 2024 •

edited

Loading